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Abstract 

Background: A real-time peptide-spectrum matching (RT-PSM) algoritlim is a database searcli metliod to interpret 
tandem mass spectra (MS/MS) witin strict time constraints. Restricted by tlie liardware and arcliitecture of individual 
workstation, previous RT-PSM algorithms either are not fast enough to satisfy all real-time system requirements or 
need to sacrifice the level of inference accuracy to provide the required processing speed. 

Results: We develop two parallelized algorithms for MS/MS data analysis: a multi-core RT-PSM (MC RT-PSM) 
algorithm which works on individual workstations and a distributed computing RT-PSM (DC RT-PSM) algorithm 
which works on a computer cluster. Two data sets are employed to evaulate the performance of our proposed 
algorithms. The simulation results show that our proposed algorithms can reach approximately 216.9-fold speedup 
on a sub-task process (similarity scoring module) and 84.78-fold speedup on the overall process compared with a 
single-thread process of the RT-PSM algorithm when 240 logical cores are employed. 

Conclusions: The improved RT-PSM algorithms can achieve the processing speed requirement without sacrificing 
the level of inference accuracy. With some configuration adjustments, the proposed algorithm can support many 
peptide identification programs, such as XITandem, CUDA version RT-PSM, etc. 



Background 

Tandem mass spectrometry (MS/MS) has been widely 
used in the early detection of diseases, chemical analysis 
and pharmaceutical industry. It can efficiently identify 
and characterize the protein component information 
in complex biological mixtures. Interpretations of MS/MS 
spectra need to perform peptide-spectrum matches (PSMs) 
by searching experimental MS/MS spectra against a protein 
sequence database. 

In order to improve the efficiency and the accuracy of 
MS/MS experiments, a real-time peptide identification 
procedure needs to be involved in a mass spectrometry 
system which analyzes peptides and performs the PSMs 
in a peptide identification procedure life-circle. Wu et al. 
[1] have proposed a pretty fast procedure, called real-time 
PSM (RT-PSM). The key component is "identifying 
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peptides", which is performed by a software application 
[1]. However, this RT-PSM procedure does not include 
any external software controlling features. Although the 
method is fast, further experiments indicate that the 
programming still cannot completely satisfy all real- 
time system requirements, since it is a single-thread 
program that runs on a single workstation. 

As a real-time system, the time window of each peptide 
identification procedure is limited by the spectrum acquir- 
ing time of mass spectrometers. It could be between 
0.05 second to 0.5 second due to different mass spectrom- 
eters. To fit in the narrow time window, using parallel 
computation to improve the speed of PSMs is becoming a 
trend. Duncan et al. [2] develop a program called Parallel 
Tandem by using a computer cluster. It processes MS/MS 
in parallel by using XITandem and a computer cluster 
with Parallel Virtual Machine (PVM) or Message Passing 
Interface (MPI). Sadygov et al. [3] develop the parallel ver- 
sion of SEQUEST, which is also based on the PVM in a 
computer cluster. Diament et al. [4] further develop a faster 
SEQUEST, called Tide, to speed up the performance of the 



© 2014 Sun et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain 
Dedication waiver (http://creativecommons.0rg/publicdomain/zero/l.O/) applies to the data made available in this article, 
unless otherwise stated. 



Sun et ol. Proteome Science 2014, 12:18 
http://www.proteonnesci.conn/content/1 2/1 /1 8 



Page 2 of 9 



SEQUEST. It acheives up to 170 times faster than 
SEQUEST. Zhang et al. [5,6] use SIMD instructions in 
a single workstation to develop programs for improving 
the speed of peptide identification procedures. Graumann 
et al. [7] recently develop a framework of intelligent agent, 
termed MaxQuant Real-Time, which is implemented in 
the MaxQuant computational proteomics environment. 
The framework is especially uesful for new instrument 
types, such as the quadrupole-Qrbitrap. 

No matter using a computer cluster or a single work- 
station, the principles of parallel computing are identical: 
dividing a large sequential process into several inde- 
pendent sub-processes and executing the sub-processes 
concurrently to reduce the execution time [8]. However, 
those previous parallel computing methods [1-7] still have 
some room to be improved. In terms of processing time, 
parallel forms of XITandem [2] and SEQUEST [3] spend 
more time than the RT-PSM algorithm proposed in [1] 
when analyzing individual spectra. Although the Tide [4] 
is already very fast, the speed can still be improved. In 
terms of computing environments, SIMD instructions 
are restricted by the CPU L2 Cache [9,10], which often 
needs to sacrifice the level of inference accuracy to 
achieve the time limitation of a real-time system, while 
a computer cluster circumvents this problem. More- 
over, instead of design a specific program, we aim to 
develpe a general platform that can support many peptide 
identification programs. 

In this paper, we develop an improved peptide identi- 
fication procedure on a computer cluster based on the 
RT-PSM algorithm proposed by Wu et al. in [1]. Two 
parallel algorithms are developed in this study: a multi-core 
RT-PSM algorithm (MC RT-PSM) which works on an indi- 
vidual workstation in form of a multi-thread program and a 
distributed computing RT-PSM algorithm (DC RT-PSM) 
which works on a computer cluster in form of a distributed 
computing program. The DC RT-PSM is built by using the 
parallelized MC RT-PSM procedure, which allocates and 
manages task computating resources through a head node 
in the distrubuted computing procedure. Source code of 
the DC RT-PSM algorithm and sample data are available in 
the Additional file 1. The improved algorithms can achieve 
processing speed requirements without sacrificing the level 
of inference accuracy. 

Results and discussion 

Experimental environment and data sources 

The experimental computer cluster consists of one head 
node and 32 worker nodes, which is connected with 1 
Gigabit Ethernet. Each node has 8 logical CPU cores. 

Two datasets are employed to test the improved al- 
gorithms in this study. Dataset A is the one used in the 
RT-PSM package [1]: the MS/MS spectrum experimental 
data source includes 2058 group spectrum data and the 



protein database is taken from a subset of the UniReflOO 
human protein database. It contains over 2200 entries 
(over 180000 peptide sequences). Dataset B includes 16463 
groups of experimental spectrum data and over 3300 
entries. It is also generated from the UniReflOO human 
protein database. 

The level of inference accuracy for the improved algorithms 

The purpose of the parallel computing processing in our 
improved algorithms is to reduce the peptide identification 
time. Hence, the proposed algorithms do not gain better 
performance by decreasing the level of inference accuracy. 
The results of the improved algorithms should be identical 
with the original RT-PSM in [1]. We randomly choose 
100 groups of experimental data from the results of our 
improved alogirthms and the original RT-PSM. The 
identification results are in excellent agreement between the 
original RT-PSM program and our improved MC RT-PSM 
and DC RT-PSM programs. 

The time speedup of the 2-Dimensional peptide database 
search method 

Rather than using binary search method to query the 
peptide sequence database, we propose to employ the 
2-Dimensional peptide database search method. Although 
this new search method does not improve the accuracy of 
candidate peptide selection, it can speed-up the procedure 
to a certain degree. For example, in Dataset A, the new 
search method makes the similarity-scoring module spent 
less than 6.7% execution time. The detail information of 
time spending is shown in Table 1. 

The time speedup of the MC RT-PSM procedure 

As expected, the performance of MC RT-PSM mainly 
depends on the speed of CPU frequency and the number 
of logical cores of the CPU. The MC RT-PSM is tested on 
four different computers. Table 2 shows the detail informa- 
tion of those computers' CPUs. 

In terms of the time speedup, we compare the MC 
RT-PSM with the RT-PSM proposed by Wu et al. in [1]. 
Figure 1 illustrates the time speedup of the MC RT-PSM 
procedure. The numerical experiments are conducted 
on the same dataset (Dataset A). When using one 
single-thread, the MC RT-PSM achieves about 5-fold 
speedup than the RT-PSM proposed by Wu et al. [1]. 
When increasing the thread number to eight, the MC 



Table 1 Comparisons between the binary search method 
and the 2-dimensional peptide database search method 



Method 


Spectra number 


Time (ms) 


Binary search method 


2058 


8.043 


2-Dimensional peptide 


2058 


7.668 


database search method 
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Table 2 The detail information of experiment hardware 
environments 



Name 


CPU Threads 


HT 


Usage 




WS1 


\7 3770 8 


YES 


Personal server 




WS2 


i5 750 4 


NO 


Development PC 


a. 


WS3 


XEONE5410 8 


NO 


Worker node of cluster 


66 du 


WS4 


\7 2720QM 8 


YES 


Personal computer 


C/5 



RT-PSM achieves about 25 to 34-fold speedup than 
the RT-PSM procedure. 

The time speedup of the DC RT-PSM procedure 

The performance of the DC RT-PSM is compared with 
the single- thread MC RT-PSM procedure. Two tasks are 
designed for comparisons. Task 1 is to search 2058 spec- 
tra against 2200 protein entries that conducted on Data- 
set A. Task 2 is to search 16463 spectra against 3300 
protein entries that conducted on Dataset B. 

The comparison is first done on the similarity-scoring 
module, which is the core part of those algorithms. For 
taks 1, the DC RT-PSM is 53.64-fold speedup with 80 
threads (10 worker nodes), 105.11-fold speedup with 160 
threads (20 worker nodes) and 124.91 -fold speedup with 
240 threads (30 worker nodes) compared with the single- 
thread MC RT-PSM program. For taks 2, the DC RT-PSM 
is 69.09-fold speedup with 80 threads, 155.37-fold speedup 
with 160 threads and 216.90-fold speedup with 240 threads 
compared with the single-thread MC RT-PSM program. 
The results are shown in Figure 2. Generally, an ideal 
parallel computing algorithm is able to gain k-fold 
speedup when a task is allocated into k threads. The 
performance of DC RT-PSM is close to the theoretical 
performance, especially when it is used to search a large 
scale database (such as task 2 in our experiments), which 
is quite promising. 
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Figure 2 The speedup of execution time of the similarity-scoring 
module for the DC RT-PSM compared with MC RT-PSM algorithm 
in two experiment databases. 



The comparison is then done on the overall performance 
of those programs. For task 1, no matter whether 80, 
160 or 240 threads are allocated, the whole time spent 
by the DC RT-PSM is about 11-fold speedup compared 
with the single MC RT-PSM program. For task 2, the 
DC RT-PSM is 48.44-fold speedup with 80 threads, 
67.82-fold speedup with 160 threads and 84.78-fold 
speedup with 240 threads compared with the single- 
thread MC RT-PSM program. The results are shown in 
Figure 3. The decreased fold speedup of the overall 
performances of DC RT-PSM is due to the fact of sys- 
tem natures. In the DC RT-PSM program, the task ini- 
tial time and node message communication time are 
fixed, even worker nodes are connected with 1 Gigabit 
Ethernet. The time spent by those processes is about 
2.0 seconds to 2.3 seconds. Therefore, if the experi- 
mental spectrum dataset is too small, the number of 
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Figure 1 The speedup of execution time of the similarity-scoring module for the MC RT-PSM compared with RT-PSM algorithm 
proposed by Wu et al. [1] in four experiment computers. 



Sun et al. Proteome Science 2014, 12:18 
http://www.proteonnesci.conn/content/1 2/1 /1 8 



Page 4 of 9 



"Experiment Dataset A 



"Experiment Dataset B 



200 































84.78 








57.82 




00 


^11.05 




11.05 


11.05 



50 



100 150 
Thread Number 



200 



250 



Figure 3 The overall speedup of execution time for the DC 
RT-PSM compared with MC RT-PSM algorithm in two 
experiment databases. 



nodes allocated in a task could barely affect the total 
execution time, just like the performance of task 1 shows 
in Figure 3. Even though, the time speedup from DC 
RT-PSM algorithm is still promissing that can satisfies 
almost all real-time system requirements. 

Conclusions 

In this paper, we have proposed an MC RT-PSM algorithm 
which works on an individual workstation and a DC RT- 
PSM algorithm which works on a computer cluster for 
interpreting MS/MS spectra. The MC RT-PSM algorithm 
is an extension of the single-thread RT-PSM algorithm pro- 
posed by Wu et al. in [1], while the DC RT-PSM algorithm 
is a distributed parallel computing algorithm that allocates 
and manages cluster worker nodes to perform the MS/MS 
spectrum analysis. 

One advantage of our proposed method is that it is a gen- 
eral platform of parallel computing, since many current 
parallel algorithms are either not fast enough for all real- 
time MS/MS systems or restricted to specific computing 
environments. The distributed computing algorithm is de- 
signed not only for this RT-PSM algorithm but also for 
other similar algorithms. It can support many other peptide 
identification programs with some configuration adjust- 
ments, such as XITandem, SEQUEST, SIMD version RT- 
PSM, etc.. The other advantage of our method is that it can 
speed up the searching time. The proposed DC RT-PSM al- 
gorithm can reach the real-time constraints of most MS/ 
MS systems without sacrificing the level of inference 
accuracy. 

Methods 

The performance of the RT-PSM program 

The RT-PSM program proposed by Wu et al. [1] is a 
single-thread program. It contains four main steps. The 



first step is to load the peptide database and raw experi- 
mental spectrum data. The second step is to select candi- 
date peptides from the peptide database. The masses of 
candidate peptides are those in the range of the experi- 
mental spectrum. Once a group of candidate peptides are 
selected, scores of every peptide-spectrum pairs are calcu- 
lated in the third step (the similarity-scoring module). In 
the last step, after the program computes the statistical 
significance of the highest similarity score for each group, 
the final results are displayed. The workflow of the 
RT-PSM algorithm is shown in Figure 4. 

The similarity-scoring module is the most time- 
consuming part in the RT-PSM program. It consumes 
over 95% CPU time in profiling experiments [5]. This 
is due to the fact that each spectrum has to be com- 
pared with the whole set of candidate peptides, which 
could easily contain thousands of peptide sequences. 
Hence, it is critical to reduce the computing time of 
the similarity-scoring module in terms of satisfying 
the time constraint of a real-time system. 

In this paper, we develop both a multiple core com- 
puting algorithm and a distrubited algorithm to speedup 
the performance of the RT-PSM program. The com- 
parison is made betweet our algorithm and the RT- 
PSM algorithm in [1]. The definition of sensitivity and 
specificity in [1] refer from the textbook [11], which 
are different from currently widely accepted formula. 
We ignore the name of sensitivity and spedificity, but 
employ the evalutation formula used in [1] to carry out 
our comparisons. 

The similarity-scoring module of the RT-PSM program 

Given an experimental spectrum, the similarity-scoring 
module searches the database of candidate peptides to 
find the best matched one according to a similarity 
score. The similarity score is calculated by comparing 



Loading the peptide database and spectra data 

1 



qkfind(): Selecting candidate peptide from peptide database 



cscore(): Scoring the peptide-spectrum matches 



pscore(): Reprocessing the scores by statistical significance 



end 



Figure 4 The workflow of the RT-PSM algorithm proposed by 
Wu etal. [1]. 
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Table 3 Types of fragment ions and their m/z values in 
the RT-PSM program 



Ion type 


m/z value 




b 


b^-HsO 


b-18 


b+-NH3 


b-17 


b^-C0(a1 


b-28 




y 




y-18 


y+-NH3 


y-17 


y+-NH(z+) 


y-15 



The signs '+' in superscript of the letter b (and y) denote the singly charge 
positive ions. The symbol b (and y) without any superscript denote the m/z 
value of a b-(or y-)ion with a single positive charge. The symbol '-H2O', '-NH3', 
'-CO' and '-NH' represent an ion lose a small molecule of 'H2O', 'NH3', 'CO' 
and 'NH', respectively. 



the difference of m/z values of ions between the experi- 
mental MS/MS spectrum and a theoretical spectrum 
of a peptide in the candidate database. Eight kinds of 
fragment ions are considered in the RT-PSM program, 
which are listed in Table 3. 

Generally, it is not necessary to search the whole 
database for finding the best-matched candidate pep- 
tide. The mass difference between an experimental 
peptide and its matched candidate peptide is often very 
small. A nearest neighbor search (NNS) is employed 
in the RT-PSM algorithm. Suppose Mm is the mass of 
an experimental peptide and t is the tolerance range of 
the NNS. Only those candidate peptides with mass 
range between M^ - 1 and M^ + 1 need to be consid- 
ered. The RT-PSM program proposed by Wu et al. [1] 
employs the most common binary search method to 
perform the NNS [12]. The time complexity of the 
binary search is O (log n). Hence, the time spending 
on the peptide search is related to the size of peptide 
database. 

In this study, we propose to employ an 2-Dimensional 
peptide database search method to decrease the searching 



time. The method is described as follows. First, in- 
stead of treading the whole peptide database as a 
large array, the database is separated into a series of 
small peptide groups (indices) according to the inte- 
ger part of the peptide mass. After that, each subset 
is indexed according to the integer part of the mass 
value. The integer peptide database is a sorted col- 
lection containing indexed sub-databases as shown 
in Figure 5. 

With this improved data structure of peptide data- 
base, the peptide searching consists of two steps. The 
first step is to search if the integer part X of the target 
peptide mass with tolerance value t is indexed by the 
peptide database (X ± t). If the value is found, then the 
first record in the indexed sub-array is the matched 
peptide, and the time complexity of this step is 0(2 t). 
If the first step cannot find a matched peptide and the 
database also contains a subset with index (X-1), then 
the second step is using the binary search method to 
check if this subset contains the matched peptide. The time 
complexity of the second step is 0(log(subset length)). 
The pseudo code of the 2-dimensional peptide database 
search method for peptide database searching is shown 
in Algorithm 1. 

In terms of the time consuming for each peptide P in 
the candidate peptide group, the scoring time t^ is. 

n 

tk = ^(tii+t2i), 

i=l 

where n is the number of ion types that are consid- 
ered in the algorithm, tn is the peptide searching time, 
t2i is the peptide scoring time. For each candidate 
peptide group, the total time of the similarity-scoring 
module is 

N 

ttotai = y^tk, 

k=l 

where N is the number of peptides in the group. 



(a) 



1811.00134 



1811.00313 



1813.03339 



1813.04215 



1813.04665 



(b) 



1811 



1812 



1813.03339 



1813.04665 



1812.01587 



1811.00134 



1811.00313 



Figure 5 An improved data structure for searchiing peptide database, (a) The original Ppptide data structure in the form of a large array; 
(b) The improved peptide data structure with the integer part of the mass value as indices. 
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Algorithm 1: 2-Dimensional Peptide Database Search Method 

function integer qkfind(OnePeptideGroup, PeptideDB, tolerance) 
//Multidimensional Search: 

if OnePeptideGroup.mass > PeptideDB. lastRecord.Mass OR 

OnePeptideGroup.mass < PeptideDB. firstRecord.Mass then 
return -1; 

else 

lowBoundary ^ (OnePeptideGroup.mass - tolerance) + 1 ; 
upBoundary ^ (OnePeptideGroup.mass + tolerance); 

for i ^ lowBoundary; i< upBoundary; i++ do 
if PeptideDB haslndex (i) then 

subset ^ PeptideDB.item(i); 

return subSet.firstltem.Index; 
else if PeptideDB hasIndex(lowBounday -1) then 

subSet <— PeptideDB. item(lowBounday -1); 

Index <— BinarySearch(subSet, OnePeptideGroup.mass) 

return Index; 

end function 



Multi-core Computing and Distributed Computing 

The similarity-scoring module in the RT-PSM program 
is a typical CPU-bound computation function, which 
means the computing time of the function is determined 
principally by the speed of CPU. Normally, one proces- 
sor can only execute one function at one time. In order 
to reduce the time consumed for the similarity-scoring 
module, we propose a parallel algorithm that combines 
the advantages of multi-core computing and distributed 
computing to achieve the maximum performance. 

The multi-core RT-PSM (MC RT-PSM) algorithm 

The MC RT-PSM is based on the Hyper-Threading 
Technology (HT Technology), which is a form of simultan- 
eous multi-threading that takes advantage of super scalar 
architecture (multiple instructions operating on separate 
data in parallel) [13]. Based on this technology, the CPU- 
bound computations can execute multiple scoring functions 
concurrently in a single-CPU workstation [14]. The work- 
flow of the MC RT-PSM program is illustrated in Figure 6. 



The maximum number of threads can be used in the 
MC RT-PSM is based on the number of logical processors. 
The pseudo code of MC RT-PSM algorithm is shown in 
Algorithm 2. 

The distributed computing RT-PSM (DC RT-PSM) algorithm 

Similar to MC RT-PSM algorithm, the DC RT-PSM 
algorithm also needs to separate a large task into several 
sub-tasks and executes them concurrently. However, 
they are different in the following two aspects. Firstly, the 
DC RT-PSM algorithm is designed to run on a distributed 
computer, such as a computer cluster, rather than a single- 
CPU workstation. The cluster is a computer system with 
the processing elements connected as a network. The 
Windows HPC SDK package provides a stable and user- 
friendly development environment for us to develop the 
program of the DC RT-PSM algorithm. Secondly, each 
processor has its own memory in the DC RT-PSM pro- 
gram, while all processors access to a shared memory in 
the MC RT-PSM program [15]. 
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Figure 6 The workflow of the MC RT-PSM program. 
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Algorithm 2: The pseudo code of MC RT-PSM algorithm 

class Peptideldentification 

Collection PeptideDB; //peptide database 
Collection ExperimentPeptideData; //experiment data 

function PIF() 

PeptideDBLoading (DBfile, PeptideDB); 
//load peptide database 

ExperimentalPeptideDataLoading(DataFile, ExperimentPeptideData); 
//load experiment data 

MaxThreadNum ^ Maximun number of CPU logical cores //obtain thread number 
InitMultiThreadCalc(MaxThreadNum, PeptideDB , ExperimentPeptideData) ; 
//initial each thread 

StartThreadList(MaxThreadNum, PeptideDB, ExperimentPeptideData); 
//run mukithread RT-PSM 
end function 

function Pipid (PeptideDB, ExperimentPeptideData) 
//RT-PSM algorithm 

Tolerance^— user-defined peptide search tolerance value; 
Collection score; // similarity score list 

for each OnePeptideGroup in ExperimentPeptideData do 
filter (OnePeptideGroup,PeptideDB); 

candidatePeptideData <— qkfind(OnePeptideGroup, PeptideDB, tolerance); 
//2-demensinal peptide search 

for each candidatePeptide in candidatePeptideData do 

msct <— cscore(OnePeptideGroup,candidatePeptideData); 

//similarity scoring function 

score. Add(msct + 1); 

Init Matrix e value; 

if pscore(evalue, score) is true postive then display result; 
//statistical significance function 

end function 

end class 

class MultiThreadCalc // Multithread control class 

void function InitMultiThreadCalc(Thread, PeptideDB, ExperimentPeptideData) 
//initial one thread 
for i<— 0;i<Thread; i++ do 

InitOneThreadPip(PeptideDB , ExperimentPeptideData) ; 

end function 

void function StartThreadList(Thread,PeptideDB, ExperimentPeptideData) 
// execute one thread 

for i<— 0;i<Thread; i++ do 

ThreadList [i] .Pipid(PeptideDB , ExperimentPeptideData) ; ; 

end function 

end class 



In our case of the DC RT-PSM algorithm, the whole 
identification procedure is divided into several sub 
RT-PSM tasks. Results of those computations are 
combined by a head node [16]. Each sub task runs in 
an individual worker node of the cluster. In order to 



achieve the minimum execution time, the head node 
creates, distributes, synchronizes and monitors tasks 
in each worker node. The pseudo code of the distrib- 
uted task management algorithm for the head node is 
shown in Algorithm 3. 
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Algorithm 3: The pseudo code of DC RT-PSM algorithm 



class DCRTPSM 

int MaxNodeNum user-defined maximum number of work nodes; 



function CreateHPCJob 

ICluster cluster <- new Cluster; 

cluster. connectQ; 

a Job <- cluster. Create Job; 

for i <- 0; i < NodeNum; i++ do 

aTask <- cluster. CreateTask;// Create a new task 

set aTask. commandline; 

set aTask. Stdout; 

set aTask. Stderr; 

set aTask. RequirtNodes; 

set aTask.MaximumNumberOfProcessors; 

aJob . AddTask(aTask) ; 

end function 

function TrackHPCJobGobID) 

while aJob is not finished do 

aJob <- cluster. GetJob(JoblD); 

check a Job status; 

handle a Job exceptions; 
end function //handle aJob exceptions; 

function HPCJobResuhCollection() 

end function //collecting results from node i 



end class 
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Figure 7 The workflow of the distributed tasl< management program. 
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The DC RT-PSM algorithm consists of three steps. 
Firstly, it loads the experimental MS/MS spectral data file 
and divides it into a specified number of small files. Each 
smaller file is assigned to a related worker node. Secondly, 
it creates a sub RT-PSM task for each worker node and 
starts all tasks simultaneously. The DC RT-PSM monitors 
all tasks when they are executing. It collects the feedback 
information, including the task status and exceptions. 
Finally, after all tasks are accomplished, the DC RT-PSM 
collects results from each worker node and generates a 
final report as the output. Figure 7 shows the workflow of 
the distributed task management program. 

The time consumption of the similarity scoring func- 
tion in DC RT-PSM algorithm is: 

tj=tij+t2j+t3j,j = l,2,...,r 

where j is the number of worker node, tij, t2j and t^j are 
the peptide searching time, the peptide scoring time and 
the message communication time spend in the J^^ worker 
node. In practice, the time consuming of message commu- 
nication in each thread is fixed. Hence, when processing a 
large dataset, the peptide searching time tij and the peptide 
scoring time t2j contribute a large amount of time consum- 
ing compared with the message communication time. The 
total time consumption of the DC RT-PSM algorithm is: 

Ctai = to+ max{tj},j = l,2,...,r. 

where to is the task initial time. 
Additional file 



Additional file 1: Source code of the DC RT-PSM algorithm and 
sample datasets. 
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